Skip to content

Conversation

@DiTo97
Copy link
Collaborator

@DiTo97 DiTo97 commented May 10, 2024

before merging, I think we should make sure that:

  • the new chromium web driver is what we want to have
  • fetch node testing functions still work properly

I cannot run ollama until the end of the weekend, so someone else should take care of testing the fetch node functionality.

DiTo97 and others added 7 commits May 10, 2024 21:05
…functionality for proxy rotation

the broker has been made fully configurable for anonymity level, admissible locations, scheme and max shape not to waste resources, unlike the original `free-proxy` package.

other options have been explored (e.g., `proxybroker`, `proxybroker2`) due to their built-in proxy server and rotation capabilities, but the former is no longer maintained, and the latter has issue with any python version outside of python 3.9
…eb driver with proxy protection and flexible kwargs and backend

the original class prevents passing kwargs down to the playwright backend, making some config unfeasible, including passing a proxy server to the web driver.

the new class has backward compatibility with the original, but 1) allows any kwarg to be passed down to the web driver, 2) allows specifying the web driver backend (only playwright is supported for now) in case more (e.g., selenium) will be supported in the future and 3) automatically fetches a suitable proxy if one is not passed already
proxies = search_proxy_servers(
anonymous=True,
countryset={"IT"},
# secure=True,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess commenting out secure=True means none were being found.

that's what I'd document well: high security and low timeouts are inversely proportional

A 'playwright' compliant proxy configuration.
"""
server = search_proxy_servers(max_shape=1, **proxy.get("criteria", {}))[0]
server = search_proxy_servers(**proxy.get("criteria", {}))[0]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only reason why I had forced max_shape=1 is cause that function will only be called by the fetch node; proxy-seeking users might as well call search_proxy_servers with the desired parameters.

_search_proxy is called only by parse_or_search_proxy which has been designed for the fetch node and specifically looks to generate one single proxy server satisfying criteria (fetch node won't use more than one)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I had removed it since it was conflicting with the max_shape specified by the user if present. I have added it back now removing the user specified max_shape if present. will make a new merge

@PeriniM
Copy link
Contributor

PeriniM commented May 13, 2024

Thanks! Writing docs

@PeriniM PeriniM merged commit b8079f8 into ScrapeGraphAI:pre/beta May 13, 2024
@github-actions
Copy link

🎉 This PR is included in version 0.11.0-beta.5 🎉

The release is available on:

Your semantic-release bot 📦🚀

@github-actions
Copy link

🎉 This PR is included in version 0.11.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants